⚡️ Speed up function `get_column_tolerance` by 13% #32

codeflash-ai · 2025-11-19T10:30:03Z

📄 13% (0.13x) speedup for `get_column_tolerance` in `datacompy/base.py`

⏱️ Runtime : 1.16 milliseconds → 1.03 milliseconds (best of 151 runs)

📝 Explanation and details

The optimization replaces a nested .get() call with explicit in checks and direct dictionary access, resulting in a 12% speedup.

Key Changes:

Original: tol_dict.get(column, tol_dict.get("default", 0.0)) - performs up to two dictionary lookups and method calls
Optimized: Uses if column in tol_dict followed by direct tol_dict[column] access - eliminates redundant lookups and method call overhead

Why It's Faster:

Eliminates double lookup: The original code may look up the same key twice when the column exists
Reduces method call overhead: Direct dictionary access tol_dict[column] is faster than .get() method calls
Short-circuit evaluation: When the column exists (common case), only one dictionary lookup is needed

Performance Characteristics:

Best case (column exists): 14-23% faster - avoids the nested .get() entirely
Default case (column missing, default exists): 3-36% slower - requires two in checks instead of one .get()
No match case (neither column nor default): 2-16% faster - eliminates unnecessary method calls

Impact on Workloads:
Based on the function references, this function is called in hot paths within datacompy.core._intersect_compare() and all_mismatch() - methods that process every column during dataframe comparison operations. Since these methods likely encounter existing columns more frequently than missing ones, the optimization will provide meaningful performance gains in typical data comparison workflows where most columns have explicit tolerance values.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 26 Passed
🌀 Generated Regression Tests	✅ 4573 Passed
⏪ Replay Tests	✅ 510 Passed
🔎 Concolic Coverage Tests	✅ 1 Passed
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_base.py::test_get_column_tolerance_column_is_default`	446ns	438ns	1.83%✅
`test_base.py::test_get_column_tolerance_default`	444ns	505ns	-12.1%⚠️
`test_base.py::test_get_column_tolerance_empty_dict`	484ns	458ns	5.68%✅
`test_base.py::test_get_column_tolerance_exact_match`	840ns	751ns	11.9%✅
`test_base.py::test_get_column_tolerance_no_default`	481ns	441ns	9.07%✅

🌀 Generated Regression Tests and Runtime

# imports
from datacompy.base import get_column_tolerance

# unit tests

# 1. Basic Test Cases


def test_column_explicitly_in_dict():
    # Test that an explicitly listed column returns its value
    tol_dict = {"colA": 0.1, "colB": 0.2, "default": 0.5}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 491ns -> 429ns (14.5% faster)
    codeflash_output = get_column_tolerance(
        "colB", tol_dict
    )  # 280ns -> 231ns (21.2% faster)


def test_column_not_in_dict_but_default_exists():
    # Test that a column not listed returns the default value
    tol_dict = {"colA": 0.1, "default": 0.5}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 488ns -> 486ns (0.412% faster)


def test_column_and_default_not_in_dict():
    # Test that if neither the column nor default is present, returns 0.0
    tol_dict = {"colA": 0.1, "colB": 0.2}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 471ns -> 410ns (14.9% faster)


def test_column_is_default_key():
    # Test that if column is "default", returns the value for "default"
    tol_dict = {"default": 0.7, "colA": 0.1}
    codeflash_output = get_column_tolerance(
        "default", tol_dict
    )  # 442ns -> 394ns (12.2% faster)


def test_column_and_default_same_value():
    # Test that if column not present, returns default even if default is 0.0
    tol_dict = {"default": 0.0}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 507ns -> 490ns (3.47% faster)


# 2. Edge Test Cases


def test_empty_dict():
    # Test with an empty dictionary
    tol_dict = {}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 517ns -> 445ns (16.2% faster)


def test_column_is_empty_string():
    # Test with column as empty string
    tol_dict = {"": 0.33, "default": 0.44}
    codeflash_output = get_column_tolerance(
        "", tol_dict
    )  # 479ns -> 405ns (18.3% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 296ns -> 383ns (22.7% slower)


def test_default_is_none():
    # Test if default is None (should return None, but function expects float, so test behavior)
    tol_dict = {"default": None}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 453ns -> 467ns (3.00% slower)


def test_column_is_none():
    # Test if column is None (should not match any key, returns default or 0.0)
    tol_dict = {"colA": 0.1, "default": 0.2}
    codeflash_output = get_column_tolerance(
        None, tol_dict
    )  # 520ns -> 525ns (0.952% slower)


def test_tol_dict_has_non_float_values():
    # Test with non-float values in dict
    tol_dict = {"colA": "not_a_float", "default": 0.1}
    # Should return the string if column matches, even though it's not a float
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 436ns -> 399ns (9.27% faster)
    # Should return default as float if column doesn't match
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 294ns -> 379ns (22.4% slower)


def test_tol_dict_has_nested_dict():
    # Test with a nested dict as value
    tol_dict = {"colA": {"nested": "dict"}, "default": 0.1}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 436ns -> 411ns (6.08% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 285ns -> 364ns (21.7% slower)


def test_tol_dict_with_int_values():
    # Test with integer values
    tol_dict = {"colA": 1, "default": 2}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 433ns -> 398ns (8.79% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 275ns -> 342ns (19.6% slower)


def test_tol_dict_with_negative_values():
    # Test with negative values
    tol_dict = {"colA": -0.1, "default": -0.2}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 433ns -> 405ns (6.91% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 271ns -> 312ns (13.1% slower)


def test_tol_dict_with_zero_values():
    # Test with zero values
    tol_dict = {"colA": 0.0, "default": 0.0}
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 432ns -> 364ns (18.7% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 296ns -> 337ns (12.2% slower)


def test_tol_dict_with_duplicate_keys():
    # Python dicts can't have duplicate keys, but test if column matches first occurrence
    tol_dict = {"colA": 0.1, "colA": 0.2, "default": 0.3}
    # Only last value for 'colA' remains
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 435ns -> 379ns (14.8% faster)


def test_tol_dict_with_special_characters_in_column():
    # Test with special characters in column name
    tol_dict = {"col$A": 0.1, "default": 0.2}
    codeflash_output = get_column_tolerance(
        "col$A", tol_dict
    )  # 466ns -> 388ns (20.1% faster)
    codeflash_output = get_column_tolerance(
        "col@B", tol_dict
    )  # 317ns -> 401ns (20.9% slower)


def test_tol_dict_with_case_sensitivity():
    # Test case sensitivity
    tol_dict = {"ColA": 0.1, "cola": 0.2, "default": 0.3}
    codeflash_output = get_column_tolerance(
        "ColA", tol_dict
    )  # 426ns -> 368ns (15.8% faster)
    codeflash_output = get_column_tolerance(
        "cola", tol_dict
    )  # 260ns -> 232ns (12.1% faster)
    codeflash_output = get_column_tolerance(
        "COLA", tol_dict
    )  # 217ns -> 290ns (25.2% slower)


def test_tol_dict_with_spaces_in_column():
    # Test column names with spaces
    tol_dict = {"col A": 0.1, "default": 0.2}
    codeflash_output = get_column_tolerance(
        "col A", tol_dict
    )  # 418ns -> 344ns (21.5% faster)
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 301ns -> 342ns (12.0% slower)


def test_tol_dict_with_numeric_column_names():
    # Test numeric column names
    tol_dict = {123: 0.1, "default": 0.2}
    codeflash_output = get_column_tolerance(
        123, tol_dict
    )  # 502ns -> 484ns (3.72% faster)
    codeflash_output = get_column_tolerance(
        "colA", tol_dict
    )  # 320ns -> 391ns (18.2% slower)


# 3. Large Scale Test Cases


def test_large_dict_all_explicit():
    # Test with a large dictionary where all columns are explicitly listed
    tol_dict = {f"col{i}": float(i) for i in range(1000)}
    for i in range(1000):
        codeflash_output = get_column_tolerance(
            f"col{i}", tol_dict
        )  # 212μs -> 185μs (14.4% faster)
    # Test a column not present
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 231ns -> 310ns (25.5% slower)


def test_large_dict_with_default():
    # Test with a large dictionary and a default value
    tol_dict = {f"col{i}": float(i) for i in range(999)}
    tol_dict["default"] = 3.14159
    for i in range(999):
        codeflash_output = get_column_tolerance(
            f"col{i}", tol_dict
        )  # 214μs -> 185μs (15.8% faster)
    # Test a column not present
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 248ns -> 391ns (36.6% slower)


def test_large_dict_all_default():
    # Test with a large dictionary where only default is present
    tol_dict = {"default": 2.71828}
    for i in range(1000):
        codeflash_output = get_column_tolerance(
            f"col{i}", tol_dict
        )  # 196μs -> 175μs (11.8% faster)


def test_large_dict_no_default():
    # Test with a large dictionary and no default value
    tol_dict = {f"col{i}": float(i) for i in range(1000)}
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 582ns -> 570ns (2.11% faster)


def test_large_dict_with_mixed_types():
    # Test with a large dictionary with mixed value types
    tol_dict = {f"col{i}": i if i % 2 == 0 else float(i) for i in range(500)}
    tol_dict["default"] = "mixed"
    for i in range(500):
        expected = i if i % 2 == 0 else float(i)
        codeflash_output = get_column_tolerance(
            f"col{i}", tol_dict
        )  # 109μs -> 93.3μs (16.9% faster)
    codeflash_output = get_column_tolerance(
        "colX", tol_dict
    )  # 234ns -> 371ns (36.9% slower)


def test_large_dict_with_long_column_names():
    # Test with long column names
    tol_dict = {f"col{'x' * 50}{i}": float(i) for i in range(1000)}
    for i in range(1000):
        codeflash_output = get_column_tolerance(
            f"col{'x' * 50}{i}", tol_dict
        )  # 241μs -> 212μs (13.6% faster)
    codeflash_output = get_column_tolerance(
        "colY", tol_dict
    )  # 247ns -> 314ns (21.3% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

# imports
from datacompy.base import get_column_tolerance

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------


def test_basic_column_present():
    # Test when the column is present in the dictionary
    tol_dict = {"col1": 0.01, "col2": 0.02}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 620ns -> 567ns (9.35% faster)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 267ns -> 217ns (23.0% faster)


def test_basic_column_not_present_with_default():
    # Test when the column is not present but "default" is
    tol_dict = {"col1": 0.01, "default": 0.05}
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 490ns -> 536ns (8.58% slower)
    codeflash_output = get_column_tolerance(
        "col3", tol_dict
    )  # 278ns -> 312ns (10.9% slower)


def test_basic_column_not_present_no_default():
    # Test when the column is not present and no "default" exists
    tol_dict = {"col1": 0.01}
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 450ns -> 426ns (5.63% faster)
    codeflash_output = get_column_tolerance(
        "col3", tol_dict
    )  # 260ns -> 289ns (10.0% slower)


def test_basic_default_column_explicit():
    # Test when "default" is explicitly queried
    tol_dict = {"col1": 0.01, "default": 0.05}
    codeflash_output = get_column_tolerance(
        "default", tol_dict
    )  # 462ns -> 453ns (1.99% faster)


def test_basic_column_present_overrides_default():
    # Test that column-specific tolerance overrides "default"
    tol_dict = {"col1": 0.01, "default": 0.05}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 445ns -> 425ns (4.71% faster)


# ---------------------------
# Edge Test Cases
# ---------------------------


def test_edge_empty_dict():
    # Test with an empty dictionary
    tol_dict = {}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 497ns -> 402ns (23.6% faster)
    codeflash_output = get_column_tolerance(
        "default", tol_dict
    )  # 267ns -> 228ns (17.1% faster)


def test_edge_default_is_zero():
    # Test when "default" is explicitly zero
    tol_dict = {"default": 0.0}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 444ns -> 513ns (13.5% slower)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 288ns -> 283ns (1.77% faster)


def test_edge_column_name_is_empty_string():
    # Test when column name is an empty string
    tol_dict = {"": 0.15, "default": 0.05}
    codeflash_output = get_column_tolerance(
        "", tol_dict
    )  # 471ns -> 420ns (12.1% faster)
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 291ns -> 404ns (28.0% slower)


def test_edge_column_name_is_none():
    # Test when column name is None
    tol_dict = {None: 0.2, "default": 0.05}
    codeflash_output = get_column_tolerance(
        None, tol_dict
    )  # 510ns -> 442ns (15.4% faster)
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 295ns -> 450ns (34.4% slower)


def test_edge_tol_dict_has_non_float_values():
    # Test when tol_dict contains non-float values
    tol_dict = {"col1": "0.1", "default": 0.05}
    # Should return the value as stored, even if not float
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 436ns -> 402ns (8.46% faster)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 278ns -> 392ns (29.1% slower)


def test_edge_tol_dict_has_int_values():
    # Test when tol_dict contains integer values
    tol_dict = {"col1": 2, "default": 5}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 405ns -> 386ns (4.92% faster)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 274ns -> 371ns (26.1% slower)


def test_edge_tol_dict_has_negative_values():
    # Test when tol_dict contains negative values
    tol_dict = {"col1": -0.01, "default": -0.05}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 407ns -> 389ns (4.63% faster)
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 301ns -> 385ns (21.8% slower)


def test_edge_tol_dict_has_nan_and_inf():
    # Test when tol_dict contains float('nan') and float('inf')
    tol_dict = {"col1": float("nan"), "col2": float("inf"), "default": 0.1}
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 453ns -> 369ns (22.8% faster)
    codeflash_output = get_column_tolerance(
        "col3", tol_dict
    )  # 318ns -> 387ns (17.8% slower)


def test_edge_tol_dict_has_multiple_defaults():
    # Test that only one "default" key is respected
    tol_dict = {"col1": 0.01, "default": 0.05, "Default": 0.1}
    codeflash_output = get_column_tolerance(
        "col2", tol_dict
    )  # 410ns -> 470ns (12.8% slower)


def test_edge_column_name_is_integer():
    # Test when column name is an integer
    tol_dict = {1: 0.5, "default": 0.05}
    codeflash_output = get_column_tolerance(
        1, tol_dict
    )  # 494ns -> 465ns (6.24% faster)
    codeflash_output = get_column_tolerance(
        2, tol_dict
    )  # 276ns -> 359ns (23.1% slower)


def test_edge_tol_dict_has_extra_keys():
    # Test when tol_dict has extra unrelated keys
    tol_dict = {"col1": 0.01, "col2": 0.02, "default": 0.05, "extra": 999}
    codeflash_output = get_column_tolerance(
        "col1", tol_dict
    )  # 422ns -> 417ns (1.20% faster)
    codeflash_output = get_column_tolerance(
        "col3", tol_dict
    )  # 303ns -> 400ns (24.2% slower)


# ---------------------------
# Large Scale Test Cases
# ---------------------------


def test_large_scale_many_columns():
    # Test with a dictionary containing 1000 columns
    tol_dict = {f"col{i}": i * 0.001 for i in range(1000)}
    tol_dict["default"] = -1.0
    # Test some random columns
    codeflash_output = get_column_tolerance(
        "col0", tol_dict
    )  # 544ns -> 529ns (2.84% faster)
    codeflash_output = get_column_tolerance(
        "col500", tol_dict
    )  # 288ns -> 287ns (0.348% faster)
    codeflash_output = get_column_tolerance(
        "col999", tol_dict
    )  # 204ns -> 345ns (40.9% slower)
    # Test a column not present
    codeflash_output = get_column_tolerance(
        "col1001", tol_dict
    )  # 232ns -> 299ns (22.4% slower)


def test_large_scale_all_default():
    # Test with a dictionary of 1000 unrelated keys and only "default"
    tol_dict = {f"foo{i}": i for i in range(1000)}
    tol_dict["default"] = 42.42
    # Should always return default for unknown column
    codeflash_output = get_column_tolerance(
        "not_in_dict", tol_dict
    )  # 479ns -> 508ns (5.71% slower)


def test_large_scale_no_default():
    # Test with a dictionary of 1000 unrelated keys and no "default"
    tol_dict = {f"foo{i}": i for i in range(1000)}
    # Should always return 0.0 for unknown column
    codeflash_output = get_column_tolerance(
        "not_in_dict", tol_dict
    )  # 456ns -> 416ns (9.62% faster)


def test_large_scale_column_is_none():
    # Test with a large dictionary and column name None
    tol_dict = {f"col{i}": i * 0.1 for i in range(1000)}
    tol_dict[None] = 123.456
    tol_dict["default"] = 0.0
    codeflash_output = get_column_tolerance(
        None, tol_dict
    )  # 580ns -> 473ns (22.6% faster)


def test_large_scale_column_is_int():
    # Test with integer column keys in large dict
    tol_dict = {i: float(i) for i in range(1000)}
    tol_dict["default"] = -999.0
    codeflash_output = get_column_tolerance(
        999, tol_dict
    )  # 597ns -> 601ns (0.666% slower)
    codeflash_output = get_column_tolerance(
        1001, tol_dict
    )  # 294ns -> 450ns (34.7% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from datacompy.base import get_column_tolerance


def test_get_column_tolerance():
    get_column_tolerance("", {})

⏪ Replay Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_pytest_teststest_snowflake_py_teststest_polars_py_teststest_sparktest_sql_spark_py_teststest_fuguete__replay_test_0.py::test_datacompy_base_get_column_tolerance`	78.5μs	73.8μs	6.40%✅
`test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_base_get_column_tolerance`	75.5μs	70.3μs	7.39%✅

🔎 Concolic Coverage Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_8h8xtkx8/tmpwed2y05m/test_concolic_coverage.py::test_get_column_tolerance`	591ns	565ns	4.60%✅

To edit these changes git checkout codeflash/optimize-get_column_tolerance-mi5v2ai5 and push.

The optimization replaces a nested `.get()` call with explicit `in` checks and direct dictionary access, resulting in a **12% speedup**. **Key Changes:** - **Original**: `tol_dict.get(column, tol_dict.get("default", 0.0))` - performs up to two dictionary lookups and method calls - **Optimized**: Uses `if column in tol_dict` followed by direct `tol_dict[column]` access - eliminates redundant lookups and method call overhead **Why It's Faster:** 1. **Eliminates double lookup**: The original code may look up the same key twice when the column exists 2. **Reduces method call overhead**: Direct dictionary access `tol_dict[column]` is faster than `.get()` method calls 3. **Short-circuit evaluation**: When the column exists (common case), only one dictionary lookup is needed **Performance Characteristics:** - **Best case** (column exists): 14-23% faster - avoids the nested `.get()` entirely - **Default case** (column missing, default exists): 3-36% slower - requires two `in` checks instead of one `.get()` - **No match case** (neither column nor default): 2-16% faster - eliminates unnecessary method calls **Impact on Workloads:** Based on the function references, this function is called in hot paths within `datacompy.core._intersect_compare()` and `all_mismatch()` - methods that process every column during dataframe comparison operations. Since these methods likely encounter existing columns more frequently than missing ones, the optimization will provide meaningful performance gains in typical data comparison workflows where most columns have explicit tolerance values.

codeflash-ai bot requested a review from mashraf-222 November 19, 2025 10:30

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `get_column_tolerance` by 13% #32

⚡️ Speed up function `get_column_tolerance` by 13% #32

Uh oh!

codeflash-ai bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function get_column_tolerance by 13% #32

Are you sure you want to change the base?

⚡️ Speed up function get_column_tolerance by 13% #32

Uh oh!

Conversation

codeflash-ai bot commented Nov 19, 2025

📄 13% (0.13x) speedup for get_column_tolerance in datacompy/base.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `get_column_tolerance` by 13% #32

⚡️ Speed up function `get_column_tolerance` by 13% #32

📄 13% (0.13x) speedup for `get_column_tolerance` in `datacompy/base.py`